1 Data description

This project is aimed to identify significantly expressed proteins in the cerebral cortex of control and Down syndrome mice exposed to context fear conditioning. The data was obtained from Ahmed MM, Dhanasekaran AR, Block A, Tong S, Costa ACS, Stasko M, et al. (2015). We can upload data set and take a look at its structure:

# replace with the full path to the Data_Cortex_Nuclear.xls file to upload the data
df <- read_excel("/media/daria/DaryaNika/IB_fall2020/Statistics_in_R/data/Data_Cortex_Nuclear.xls")
head(df)
## # A tibble: 6 x 82
##   MouseID DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N pCAMKII_N pCREB_N
##   <chr>      <dbl>   <dbl>  <dbl> <dbl>  <dbl>  <dbl>   <dbl>     <dbl>   <dbl>
## 1 309_1      0.504   0.747  0.430  2.82   5.99  0.219   0.178      2.37   0.232
## 2 309_2      0.515   0.689  0.412  2.79   5.69  0.212   0.173      2.29   0.227
## 3 309_3      0.509   0.730  0.418  2.69   5.62  0.209   0.176      2.28   0.230
## 4 309_4      0.442   0.617  0.359  2.47   4.98  0.223   0.176      2.15   0.207
## 5 309_5      0.435   0.617  0.359  2.37   4.72  0.213   0.174      2.13   0.192
## 6 309_6      0.448   0.628  0.367  2.39   4.81  0.219   0.176      2.14   0.195
## # … with 72 more variables: pELK_N <dbl>, pERK_N <dbl>, pJNK_N <dbl>,
## #   PKCA_N <dbl>, pMEK_N <dbl>, pNR1_N <dbl>, pNR2A_N <dbl>, pNR2B_N <dbl>,
## #   pPKCAB_N <dbl>, pRSK_N <dbl>, AKT_N <dbl>, BRAF_N <dbl>, CAMKII_N <dbl>,
## #   CREB_N <dbl>, ELK_N <dbl>, ERK_N <dbl>, GSK3B_N <dbl>, JNK_N <dbl>,
## #   MEK_N <dbl>, TRKA_N <dbl>, RSK_N <dbl>, APP_N <dbl>, Bcatenin_N <dbl>,
## #   SOD1_N <dbl>, MTOR_N <dbl>, P38_N <dbl>, pMTOR_N <dbl>, DSCR1_N <dbl>,
## #   AMPKA_N <dbl>, NR2B_N <dbl>, pNUMB_N <dbl>, RAPTOR_N <dbl>, TIAM1_N <dbl>,
## #   pP70S6_N <dbl>, NUMB_N <dbl>, P70S6_N <dbl>, pGSK3B_N <dbl>, pPKCG_N <dbl>,
## #   CDK5_N <dbl>, S6_N <dbl>, ADARB1_N <dbl>, AcetylH3K9_N <dbl>, RRP1_N <dbl>,
## #   BAX_N <dbl>, ARC_N <dbl>, ERBB4_N <dbl>, nNOS_N <dbl>, Tau_N <dbl>,
## #   GFAP_N <dbl>, GluR3_N <dbl>, GluR4_N <dbl>, IL1B_N <dbl>, P3525_N <dbl>,
## #   pCASP9_N <dbl>, PSD95_N <dbl>, SNCA_N <dbl>, Ubiquitin_N <dbl>,
## #   pGSK3B_Tyr216_N <dbl>, SHH_N <dbl>, BAD_N <dbl>, BCL2_N <dbl>, pS6_N <dbl>,
## #   pCFOS_N <dbl>, SYP_N <dbl>, H3AcK18_N <dbl>, EGR1_N <dbl>, H3MeK4_N <dbl>,
## #   CaNA_N <dbl>, Genotype <chr>, Treatment <chr>, Behavior <chr>, class <chr>
# str(df) leads to a very long output, uncomment, if you want to take a look:
#str(df)

As we can see, the data set consists of the expression levels of 77 proteins detected in the nuclear fraction of brain cortex. Each row contains observation for 1 sample, while each column 2-78 refers to a single protein. There are 5 additional variables (“MouseID”, “Genotype”, “Treatment”, “Behavior”, “class”) containing further information:

  • Identification number of a mouse, note that suffix after "_" indicates a number of sample, can be 1-15 (MouseID)
  • Control or trisomic mouse (Genotype)
  • Stimulated to learn (context-shock) mice and not-stimulated (shock-context) mice (Behavior)
  • Memantine injected mice (Memantine) or saline injected mice (Saline) (Treatment)

We can turn all non-numeric variables into factors for further analysis.

df <- df %>% 
  mutate(across(where(is.character), ~ as.factor(.x)))

1.1 Number of mice in experiment

Although the data set consists of 1080 rows, the number of objects is less, as far as 15 measurements were made for each object. Thus, 72 mice were examined. To confirm this assumption, we can cut additional indexes in “MouseID” variable (_number) and count unique objects:

nrow(df)/15 == length(unique(str_replace(df$MouseID, pattern = "_\\d*", "")))
## [1] TRUE

1.2 What groups may be defined

According to the features mentioned above 8 classes of mice can be defined: c-CS-s: control mice, stimulated to learn, injected with saline c-CS-m: control mice, stimulated to learn, injected with memantine c-SC-s: control mice, not stimulated to learn, injected with saline c-SC-m: control mice, not stimulated to learn, injected with memantine

t-CS-s: trisomy mice, stimulated to learn, injected with saline t-CS-m: trisomy mice, stimulated to learn, injected with memantine t-SC-s: trisomy mice, not stimulated to learn, injected with saline t-SC-m: trisomy mice, not stimulated to learn, injected with memantine

Class for each mice is stored in class variable.

It is more convenient to divide MouseID variable into two variables: MouseID will indicate ID of each mouse, while MouseID_sample will indicate a number of sample (1-15) for each mouse:

df <- df %>%
  mutate(MouseID_sample = str_replace(df$MouseID, pattern = "\\d*_", ""), .before = 2,
         MouseID = str_replace(df$MouseID, pattern = "_\\d*", ""))

1.3 Are the groups balanced?

We can count number of mice in each group using dplyr functions group_by and summarise:

df %>% 
  group_by(class) %>% 
  summarise(N = length(unique(str_replace(MouseID, pattern = "_\\d*", ""))))
## # A tibble: 8 x 2
##   class      N
## * <fct>  <int>
## 1 c-CS-m    10
## 2 c-CS-s     9
## 3 c-SC-m    10
## 4 c-SC-s     9
## 5 t-CS-m     9
## 6 t-CS-s     7
## 7 t-SC-m     9
## 8 t-SC-s     9

As we can see, number of objects in group range between 7 and 10. We assume that the data is balanced, however it would be better to have more than 7 objects in t-CS-s class.

1.4 Missing values (NAs)

nrow(df[which(!complete.cases(df)), ]) # number of rows with NA
## [1] 528

We can see that there are 528 rows with missing values. As far as there are 15 replicates for each protein, we can replace some missing values with the mean value for this protein in the object:

replace_NA_grouped_mean <- function(data_with_NA){
  data_with_NA %>%
    group_by(MouseID) %>%
    mutate(across(where(is.numeric), ~ ifelse(is.na(.x), mean(.x, na.rm = T), .x)))
}
df <- replace_NA_grouped_mean(df)

nrow(df[which(!complete.cases(df)), ])
## [1] 525

Surprisingly, only 3 out of 528 NAs are gone. It means that in most cases we have missing values for all 15 replicates and can not replace them with the mean. It will be wiser not to cut out all rows with missing values, because we have only 7-10 objects in each group.

2 BDNF_N expression in differnet groups

First of all we need to check if there are any missing values in BDNF_N values and take a look on possible difference between groups based on BDNF expression:

nrow(df[which(!complete.cases(df$BDNF_N)), ]) # number of rows with NA
## [1] 0
ggplot(df, aes(class, df$BDNF_N, color = class)) + 
  stat_summary(fun.data = "mean_cl_normal") + 
  ggtitle(label = "Mean BDNF expression level") +
  ylab(label = "BDNF expression")

Assuming possible difference between groups, we next apply one-way ANOVA to compare them.

2.1 One-way ANOVA

Next we need to construct a linear mixed-effects model (LMM) with grouping factor MouseID:

mod_BDNF <- lmer(BDNF_N ~ class + (1 | MouseID), data = df)
summary(mod_BDNF)
## Linear mixed model fit by REML ['lmerMod']
## Formula: BDNF_N ~ class + (1 | MouseID)
##    Data: df
## 
## REML criterion at convergence: -4007.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.2210 -0.6299 -0.1010  0.5210  4.4457 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  MouseID  (Intercept) 0.001165 0.03414 
##  Residual             0.001137 0.03372 
## Number of obs: 1080, groups:  MouseID, 72
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  0.339217   0.011140  30.450
## classc-CS-s  0.003098   0.016186   0.191
## classc-SC-m -0.048272   0.015754  -3.064
## classc-SC-s -0.025825   0.016186  -1.595
## classt-CS-m -0.026485   0.016186  -1.636
## classt-CS-s -0.033757   0.017361  -1.944
## classt-SC-m -0.018154   0.016186  -1.122
## classt-SC-s -0.013179   0.016186  -0.814
## 
## Correlation of Fixed Effects:
##             (Intr) clssc-CS- clssc-SC-m clssc-SC-s clsst-CS-m clsst-CS-s
## classc-CS-s -0.688                                                      
## classc-SC-m -0.707  0.487                                               
## classc-SC-s -0.688  0.474     0.487                                     
## classt-CS-m -0.688  0.474     0.487      0.474                          
## classt-CS-s -0.642  0.442     0.454      0.442      0.442               
## classt-SC-m -0.688  0.474     0.487      0.474      0.474      0.442    
## classt-SC-s -0.688  0.474     0.487      0.474      0.474      0.442    
##             clsst-SC-m
## classc-CS-s           
## classc-SC-m           
## classc-SC-s           
## classt-CS-m           
## classt-CS-s           
## classt-SC-m           
## classt-SC-s  0.474

One-way Anova with III-type Sum of Square is used to compare groups (as far as we have different number of objects in some groups):

BDNF_anova <- Anova(mod_BDNF, type = "III")
BDNF_anova
## Analysis of Deviance Table (Type III Wald chisquare tests)
## 
## Response: BDNF_N
##               Chisq Df Pr(>Chisq)    
## (Intercept) 927.214  1    < 2e-16 ***
## class        15.513  7    0.02995 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Expression level of BDNF protein is significantly depends on group of mice (p_value = 0.02995, df = 7)

2.2 Model diagnostics

Data for analysis of residuals of our model is constructed by fortify function. The boxplot shows the standard deviation is constant in the distribution of residuals (heteroscedasticity).

mod_diag <- fortify(mod_BDNF)

ggplot(mod_diag, aes(x = class, y = .scresid)) + geom_boxplot() + ggtitle("Residuals vs. fitted values")

There is only a little variation in standard deviations between groups, which means that our model is appropriate.

2.3 Post-hoc tests

After indication of significant factor, we can apply Tukey Contrasts to distinguish which groups have significantly different level of BDNF expression (df = 72).

post_hoch <- glht(mod_BDNF, linfct = mcp(class = "Tukey"))
result <- summary(post_hoch)
## Warning in RET$pfunction("adjusted", ...): Completion with error > abseps
result
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lmer(formula = BDNF_N ~ class + (1 | MouseID), data = df)
## 
## Linear Hypotheses:
##                        Estimate Std. Error z value Pr(>|z|)  
## c-CS-s - c-CS-m == 0  0.0030979  0.0161862   0.191   1.0000  
## c-SC-m - c-CS-m == 0 -0.0482717  0.0157544  -3.064   0.0452 *
## c-SC-s - c-CS-m == 0 -0.0258249  0.0161862  -1.595   0.7527  
## t-CS-m - c-CS-m == 0 -0.0264852  0.0161862  -1.636   0.7276  
## t-CS-s - c-CS-m == 0 -0.0337570  0.0173606  -1.944   0.5193  
## t-SC-m - c-CS-m == 0 -0.0181541  0.0161862  -1.122   0.9522  
## t-SC-s - c-CS-m == 0 -0.0131792  0.0161862  -0.814   0.9924  
## c-SC-m - c-CS-s == 0 -0.0513696  0.0161862  -3.174   0.0324 *
## c-SC-s - c-CS-s == 0 -0.0289228  0.0166066  -1.742   0.6591  
## t-CS-m - c-CS-s == 0 -0.0295831  0.0166066  -1.781   0.6323  
## t-CS-s - c-CS-s == 0 -0.0368549  0.0177533  -2.076   0.4299  
## t-SC-m - c-CS-s == 0 -0.0212520  0.0166066  -1.280   0.9063  
## t-SC-s - c-CS-s == 0 -0.0162771  0.0166066  -0.980   0.9772  
## c-SC-s - c-SC-m == 0  0.0224468  0.0161862   1.387   0.8634  
## t-CS-m - c-SC-m == 0  0.0217865  0.0161862   1.346   0.8808  
## t-CS-s - c-SC-m == 0  0.0145147  0.0173606   0.836   0.9910  
## t-SC-m - c-SC-m == 0  0.0301176  0.0161862   1.861   0.5774  
## t-SC-s - c-SC-m == 0  0.0350925  0.0161862   2.168   0.3708  
## t-CS-m - c-SC-s == 0 -0.0006603  0.0166066  -0.040   1.0000  
## t-CS-s - c-SC-s == 0 -0.0079321  0.0177533  -0.447   0.9998  
## t-SC-m - c-SC-s == 0  0.0076708  0.0166066   0.462   0.9998  
## t-SC-s - c-SC-s == 0  0.0126457  0.0166066   0.761   0.9949  
## t-CS-s - t-CS-m == 0 -0.0072718  0.0177533  -0.410   0.9999  
## t-SC-m - t-CS-m == 0  0.0083311  0.0166066   0.502   0.9997  
## t-SC-s - t-CS-m == 0  0.0133060  0.0166066   0.801   0.9931  
## t-SC-m - t-CS-s == 0  0.0156029  0.0177533   0.879   0.9879  
## t-SC-s - t-CS-s == 0  0.0205778  0.0177533   1.159   0.9431  
## t-SC-s - t-SC-m == 0  0.0049749  0.0166066   0.300   1.0000  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

BDNF expression level is significantly higher in control mice, stimulated to learn, injected with memantine (c-CS-m) than in control mice, not stimulated to learn, injected with memantine (c-SC-m) (z value = -3.064, p_value = 0.0454). That means that there is a significant effect of training in saline-injected controls. BDNF expression level is significantly higher in control mice, not stimulated to learn, injected with saline (c-SC-s) than in control mice, stimulated to learn, injected with memantine (c-CS-m) (z value = -3.174, p_value = 0.0326).

2.4 Visualisation

We predicted values based on our model using matrices, then found standard errors, confidence intervals and visualize our result:

df_post_hoc <- data.frame(class = factor(levels(df$class), levels = levels(df$class)))

# model matrix
X <- model.matrix(~ class,
                  data = df_post_hoc)
betas = fixef(mod_BDNF)
df_post_hoc$fit <- X %*% betas

# SE and CI
df_post_hoc$se <- sqrt( diag(X %*% vcov(mod_BDNF) %*% t(X)) )
df_post_hoc$lwr <- df_post_hoc$fit - 2 * df_post_hoc$se
df_post_hoc$upr <- df_post_hoc$fit + 2 * df_post_hoc$se

ggplot(data = df_post_hoc, aes(x = class, y = fit)) +
  geom_bar(stat = "identity", aes(fill = class), width = 0.5) +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.1) +
  ggtitle(label = "BDNF expression level in mice (Tukey Contrasts)") + 
  ylab(label = "BDNF expression") +
  geom_text(aes(y = 0.25, label = c("*", "*", "*", "", "", "", "", "")))

3 ERBB4 level prediction

3.1 Linear mixed-effects model (LMM)

It is known that some proteins may share expression patterns, especially proteins from the same pathway. In order to check whether expression level of ERBB4 protein can depend on expression level of other proteins we applied linear regression. We selected proteins, which are known to be connected with Alzheimer`s disease, such as CaNA, CDK5, P3525, Tau, SNCA, IL1B, and nNOS. MouseID again is used as a random factor.

df_AD <- df %>% 
  dplyr::select(MouseID, MouseID_sample, Genotype, Treatment, Behavior, class, CaNA_N, CDK5_N, P3525_N, ERBB4_N, Tau_N, SNCA_N, IL1B_N, nNOS_N)

mod_AD <- lmer(ERBB4_N ~ CaNA_N + CDK5_N + P3525_N + Tau_N + SNCA_N + IL1B_N + nNOS_N + class + (1 | MouseID), data = df_AD)
summary(mod_AD)
## Linear mixed model fit by REML ['lmerMod']
## Formula: ERBB4_N ~ CaNA_N + CDK5_N + P3525_N + Tau_N + SNCA_N + IL1B_N +  
##     nNOS_N + class + (1 | MouseID)
##    Data: df_AD
## 
## REML criterion at convergence: -7614.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.3426 -0.6315 -0.0035  0.5712  5.2619 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  MouseID  (Intercept) 4.592e-05 0.006776
##  Residual             3.719e-05 0.006098
## Number of obs: 1080, groups:  MouseID, 72
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  0.039507   0.004496   8.787
## CaNA_N       0.011378   0.002346   4.849
## CDK5_N       0.002981   0.008611   0.346
## P3525_N      0.130700   0.015859   8.241
## Tau_N        0.055392   0.008156   6.792
## SNCA_N      -0.019825   0.017789  -1.114
## IL1B_N       0.049773   0.006182   8.051
## nNOS_N       0.151747   0.016090   9.431
## classc-CS-s -0.001801   0.003256  -0.553
## classc-SC-m  0.003590   0.003415   1.051
## classc-SC-s  0.003725   0.003391   1.098
## classt-CS-m -0.007691   0.003218  -2.390
## classt-CS-s -0.003393   0.003476  -0.976
## classt-SC-m  0.012354   0.003484   3.546
## classt-SC-s -0.003479   0.003339  -1.042
## 
## Correlation matrix not shown by default, as p = 15 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it

3.2 Selection of significant predictors

Next we applied F-test to select only significant predictors. On this step we found that CDK5_N and SNCA_N are not significant (p > 0.05) and skipped them:

drop1(mod_AD, test = 'Chi')
## Single term deletions
## 
## Model:
## ERBB4_N ~ CaNA_N + CDK5_N + P3525_N + Tau_N + SNCA_N + IL1B_N + 
##     nNOS_N + class + (1 | MouseID)
##         npar     AIC    LRT   Pr(Chi)    
## <none>       -7718.3                     
## CaNA_N     1 -7696.4 23.961 9.830e-07 ***
## CDK5_N     1 -7720.1  0.203    0.6527    
## P3525_N    1 -7653.0 67.283 2.352e-16 ***
## Tau_N      1 -7674.6 45.713 1.369e-11 ***
## SNCA_N     1 -7719.0  1.304    0.2535    
## IL1B_N     1 -7656.0 64.335 1.049e-15 ***
## nNOS_N     1 -7634.3 85.988 < 2.2e-16 ***
## class      7 -7699.5 32.853 2.820e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod_AD1 <- update(mod_AD, .~.-CDK5_N)
drop1(mod_AD1, test = 'Chi')
## Single term deletions
## 
## Model:
## ERBB4_N ~ CaNA_N + P3525_N + Tau_N + SNCA_N + IL1B_N + nNOS_N + 
##     class + (1 | MouseID)
##         npar     AIC    LRT   Pr(Chi)    
## <none>       -7720.1                     
## CaNA_N     1 -7696.3 25.786 3.814e-07 ***
## P3525_N    1 -7653.0 69.164 < 2.2e-16 ***
## Tau_N      1 -7676.0 46.086 1.132e-11 ***
## SNCA_N     1 -7720.8  1.305    0.2532    
## IL1B_N     1 -7657.1 64.977 7.577e-16 ***
## nNOS_N     1 -7631.9 90.233 < 2.2e-16 ***
## class      7 -7701.4 32.771 2.921e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod_AD2 <- update(mod_AD1, .~.-SNCA_N)
drop1(mod_AD2, test = 'Chi')
## Single term deletions
## 
## Model:
## ERBB4_N ~ CaNA_N + P3525_N + Tau_N + IL1B_N + nNOS_N + class + 
##     (1 | MouseID)
##         npar     AIC     LRT   Pr(Chi)    
## <none>       -7720.8                      
## CaNA_N     1 -7690.6  32.195 1.394e-08 ***
## P3525_N    1 -7630.4  92.432 < 2.2e-16 ***
## Tau_N      1 -7675.4  47.382 5.842e-12 ***
## IL1B_N     1 -7656.9  65.932 4.668e-16 ***
## nNOS_N     1 -7612.1 110.759 < 2.2e-16 ***
## class      7 -7702.9  31.909 4.222e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see expression level of ERBB4 protein significantly depends on expression level CaNA, P3525, Tau, IL1B, and nNOS proteins.

3.3 Diagnostics of model

Firstly we need to get data for analysis of residuals of our model. To do that I used function fortify

mod2_diag <- fortify(mod_AD2)
head(mod2_diag)
## # A tibble: 6 x 17
## # Groups:   MouseID [1]
##   MouseID MouseID_sample Genotype Treatment Behavior class CaNA_N CDK5_N P3525_N
##   <chr>   <chr>          <fct>    <fct>     <fct>    <fct>  <dbl>  <dbl>   <dbl>
## 1 309     1              Control  Memantine C/S      c-CS…   1.68  0.295   0.248
## 2 309     2              Control  Memantine C/S      c-CS…   1.74  0.294   0.258
## 3 309     3              Control  Memantine C/S      c-CS…   1.93  0.302   0.255
## 4 309     4              Control  Memantine C/S      c-CS…   1.70  0.296   0.251
## 5 309     5              Control  Memantine C/S      c-CS…   1.84  0.297   0.252
## 6 309     6              Control  Memantine C/S      c-CS…   1.82  0.288   0.244
## # … with 8 more variables: ERBB4_N <dbl>, Tau_N <dbl>, SNCA_N <dbl>,
## #   IL1B_N <dbl>, nNOS_N <dbl>, .fitted <dbl>, .resid <dbl>, .scresid <dbl>

3.3.1 Residuals vs fitted values

These plots show if there are non-linear pattern between our variables and if the standard deviation is constant in the distribution of residuals (heteroscedasticity).

gg_resid_factors <- ggplot(data = mod2_diag, aes(x = .fitted, y = .scresid)) +
  geom_boxplot() +
  geom_hline(yintercept = 2, color = "red") +
  geom_hline(yintercept = -2, color = "red")

gg_resid <- ggplot(data = mod2_diag, aes(x = .fitted, y = .scresid)) + 
  geom_point() + 
  geom_hline(yintercept = 0) + 
  geom_smooth(method = 'lm') +
  geom_hline(yintercept = 2, color = "red") +
  geom_hline(yintercept = -2, color = "red")

res_1 <- gg_resid_factors + aes(x = class)
res_2 <- gg_resid + aes(x = CaNA_N)
res_3 <- gg_resid + aes(x = P3525_N)
res_4 <- gg_resid + aes(x = Tau_N)
res_5 <- gg_resid + aes(x = IL1B_N)
res_6 <- gg_resid + aes(x = nNOS_N)

resit_predict_in <- grid.arrange(res_1, res_2, res_3, res_4, res_5, res_6, nrow = 2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

resit_predict_in
## TableGrob (2 x 3) "arrange": 6 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
## 4 4 (2-2,1-1) arrange gtable[layout]
## 5 5 (2-2,2-2) arrange gtable[layout]
## 6 6 (2-2,3-3) arrange gtable[layout]

There are some outliers and a tiny pattern in Tau_N plot, which may imply heteroscedasticity. However these deviations are not crucial and we assume that our model meets requirements of linear regression analysis.

3.4 Visualisation of group influence

Finally we created a barplot for each level of class variable:

df_post_hoc <- data.frame(class = factor(levels(df_AD$class), levels = levels(df_AD$class)),
                          CaNA_N = mean(df_AD$CaNA_N),
                          P3525_N = mean(df_AD$P3525_N),
                          Tau_N = mean(df_AD$Tau_N),
                          IL1B_N = mean(df_AD$IL1B_N),
                          nNOS_N = mean(df_AD$nNOS_N))
df_post_hoc
##    class   CaNA_N   P3525_N     Tau_N    IL1B_N    nNOS_N
## 1 c-CS-m 1.337784 0.2912763 0.2104892 0.5273487 0.1813001
## 2 c-CS-s 1.337784 0.2912763 0.2104892 0.5273487 0.1813001
## 3 c-SC-m 1.337784 0.2912763 0.2104892 0.5273487 0.1813001
## 4 c-SC-s 1.337784 0.2912763 0.2104892 0.5273487 0.1813001
## 5 t-CS-m 1.337784 0.2912763 0.2104892 0.5273487 0.1813001
## 6 t-CS-s 1.337784 0.2912763 0.2104892 0.5273487 0.1813001
## 7 t-SC-m 1.337784 0.2912763 0.2104892 0.5273487 0.1813001
## 8 t-SC-s 1.337784 0.2912763 0.2104892 0.5273487 0.1813001
# model matrix
X <- model.matrix(~ class + CaNA_N + P3525_N + Tau_N + IL1B_N + nNOS_N,
                  data = df_post_hoc)

betas = fixef(mod_AD2)
df_post_hoc$fit <- X %*% betas

# SE and CI
df_post_hoc$se <- sqrt( diag(X %*% vcov(mod_AD2) %*% t(X)) )
df_post_hoc$lwr <- df_post_hoc$fit - 2 * df_post_hoc$se
df_post_hoc$upr <- df_post_hoc$fit + 2 * df_post_hoc$se

ggplot(data = df_post_hoc, aes(x = class, y = fit)) +
  geom_bar(stat = "identity", aes(fill = class), width = 0.5) +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.1)

We assume our model to be appropriate, and we indeed can predict expression level of proteins based on expression level of other proteins. However, a choice of predictors should be done carefully, for example, proteins with shared expression pattern.

4 PCA

Principal component analysis was applied using paclage “vegan”.

4.1 Ordination

mouse_pca <- rda(df[, -c(1,2,80,81,82,83)], scale = TRUE)

4.2 Biplot

Biplots of Eigenvectors for Species scores and Site scores were constructed:

biplot(mouse_pca, scaling = "species", display = "species")

biplot(mouse_pca, scaling = "sites", display = "sites")

4.3 Proportion Explained

To found out how much was explained by each of PCs, we draw different plots

4.3.1 Screeplot

On this plot we can see that Ordination line (black) crosses Brocken Stick (red) only at PC7, which means that PC1-PC6 are of great importance and can explain a great part of variability.

screeplot(mouse_pca, type = "lines", bstick = TRUE)

4.3.2 Barplot

The same is imaged on a barplot:

pca_summary <- summary(mouse_pca)
pca_result <- as.data.frame(pca_summary$cont)
plot_data <- as.data.frame(t(pca_result[c("Proportion Explained"),]))
plot_data$component <- c(1:76)

ggplot(plot_data, aes( component, `Proportion Explained`)) + 
  geom_bar(stat = "identity") + 
  theme_bw()

4.4 Ordination plot

We can construct an ordination plot in PC1-PC2 dimensions:

df_scores <- data.frame(df,
                        scores(mouse_pca, display = "sites", choices = c(1, 2, 3), scaling = "sites"))

p_scores <- ggplot(df_scores, aes(x = PC1, y = PC2)) + 
  geom_point(aes( color = class), alpha = 0.5) +
  coord_equal(xlim = c(-1.2, 1.2), ylim = c(-1.2, 1.2)) + ggtitle(label = "Ordination") + theme_bw()
p_scores

4.5 3D ordination plot

Otherwise we can draw a 3D ordination plot using plotly:

fig <- plot_ly(df_scores, x = ~PC1, y = ~PC2, z = ~PC3, color = ~class, size = 0.5)
fig <- fig %>% add_markers()
fig

5 Find DE genes with sPLS-DA

To define specific genes responsible for main difference between groups we applied sPLS-DA analysis. Firstly we counted a proper number of principal components for further analysis:

list.keepX <- c(5:10,  seq(20, 100, 10))
set.seed(30) 
tune.splsda.srbct <- tune.splsda(df[, -c(1,2,80,81,82,83)], df$class, ncomp = 7, # число компонент берется как число групп-1
                                 validation = 'Mfold',
                                 folds = 3, dist = 'max.dist', progressBar = FALSE,
                                 measure = "BER", test.keepX = list.keepX,
                                 nrepeat = 10)   
ncomp <- tune.splsda.srbct$choice.ncomp$ncomp 
ncomp
## [1] 6
select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp] 
select.keepX
## comp1 comp2 comp3 comp4 comp5 comp6 
##     8     7     5    30    70     6

We next performed analysis with calculated arguments:

MyResult.splsda.fixed <- splsda(df[, -c(1,2,80,81,82,83)], df$class, ncomp = ncomp, keepX = select.keepX)

Finally we visualised our result:

layout(matrix(c(1, 2, 3, 3, 3, 3), 2, 3))
plotLoadings(MyResult.splsda.fixed, comp = 1, size.name = 1, size.title = 1.2, title = "Loadings\n on 1st component", contrib = "max", legend = FALSE, ndisplay = 10)
## 'ndisplay' value is larger than the number of selected variables! It has been reseted to 8 for block X
plotLoadings(MyResult.splsda.fixed, comp = 2, size.name = 1, size.title = 1.2, title = "Loadings\n on 2nd component", contrib = "max",ndisplay = 10,  legend = FALSE)
## 'ndisplay' value is larger than the number of selected variables! It has been reseted to 7 for block X
plotIndiv(MyResult.splsda.fixed, ind.names = F, ellipse = T, style = "graphics", abline = TRUE, cex = 2, pch = 19, size.axis = 1.2, size.xlabel = 1.5, size.ylabel = 1.5, title = "sPLS-DA ordination of samples", size.title = 1.5)
legend("bottomleft", legend = levels(df$class), cex = 1, fill = color.mixo(1:8), bty = "n")

We can see deferentially expressed genes that drive difference between groups on this plot.